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ABSTRACT 


As the complexity of ASIC/SoC design is increasing along with the number 
of logic gates, a prototyping process in the verification stage is facing a 
challenge when the ASIC/SoC design cannot fit into a single FPGA. 
A solution to prototyping multi-million logic gates of ASIC/SoC circuit into 
the FPGA platform for verification purpose is by partition the design into 
multi-FPGA. There are various implementation tools and platform available 
in the market which automates an FPGA-based prototype phase such as 
Cadence Protium Rapid Prototyping Platform, Synopsys and S2C. In this 
paper, Synopsys protocompiler tool will be used to perform the prototyping 
process of the large 4 core CPU based circuit into the HAPS-80 FPGA 
platform. This paper will be focusing on the partition requirement needed to 
successfully prototype the large SoC circuit into the multi-FPGA. 
The presence of cut clocks in a circuit after partition stage will resulting to 
the failure in routing stage due to the congestion error. In this paper, 
two techniques are used, which is automatic clock replication by the 
Synopsys Protocompiler tool and our proposed technique which is Manual 


Clock Distribution technique to solve the presence of the cut clock, so that 
the circuit is able to meet the partition requirement to complete the 
prototyping process into multi-FPGA. Obtained result from the proposed 
technique showing that prototyping the large SoC circuit into the multi- 
FPGA platform has met the specification by eliminating 100% presence of 
cut clock. 
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1, INTRODUCTION 

Nowadays, more resolution is required for the verification of continuous enhanced integrated design 
technologies which leads to more complex and high-performance SoC design which is currently become a 
main electronic design in semiconductor industries[1]. Design verification is an end-stage process of ensuring 
that everything on the integrated design works as planned [2] and meets the customer requirement.According 
to International Business Strategic (IBS) [3], cost of developing a full design showing that verification stage 
is expanding at an aggressive rate and has the highest rate among the overall design cost. Due to tremendous 
increases in size, complexity, and cost of SoC designs consumed in verification stage, the software developer 
can no longer wait for the chip to be fabricated for the integration of the hardware/software phase in order to 
meet the ever-shrinking time-to-market window. Therefore, FPGA-based prototyping technique is used to 
address these challenges by prototyping a SoC circuit on an FPGA so that, the design can be verified in a pre- 
silicon stage [4]. Multiple FPGA-based protoyping is an option to for the large design verification due to its 
high execution speed[5] ,[6] but require additional effort to partition in an optimized way [7]. 
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Figure 1 shows a multi-FPGA prototyping flow using the Synopsys Protocompiler tool. ASIC/SoC 
design’s RTL must be reworked to meet the FPGA based requirement as follow; top-level pads required to be 
adapted for the FPGA tool flow, gated-clock and complex generated clocks in SoC/ASIC must be 
transformed in FPGAs and memories required to be handled with FPGA memory resources. Once the circuit 
is converted into the FPGA-based, a compilation stage will take place before went through the pre-partition, 
partition, system route and system generate. Once the FPGA-based SoC circuit is successfully partitioned, 
the synthesis process will take place for each partitioned FPGA which begins with compile, and continue 
with the pre-map, map, and place and route using a Vivado tool. 
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Figure 1. Multi-FPGA prototyping flow 


In Multi-FPGA prototyping flow, the higher routing congestion which is more than the maximum 
congestion level which is set to be at level 4, will fail the FPGA partitioned design [8]. A larger SoC/ASIC 
design must be partitioned into the multiple FPGA before the routing stage is visited [9]. During the FPGA 
partitioning stage, there are few partition requirements must be achieved which is; zero unrouted nets, 
zero cut clocks, low number of feedthrough, low number of multi-hop nets, the minimum number of FPGA 
interconnect nets, and each of the FPGA utilization must be less than 65% [8]. Therefore, in this paper, 
work has been devoted to satisfy the partition requirement to avoid the routing congestion problem. 

This paper is organized as follow; Section 2 describes an existing technique to solve the challenges 
in multi-FPGA prototyping flow. While Section 3, discusses the step implemented in this paper from the 
analyzation of the failed design until the implementation of the approached techniques. Section 4 will 
summarize the approached techniques with the obtained result. 


2. LITERATURE REVIEW 

There are plenty of research have been done before to address the multi-FPGA prototyping flow 
challenges. This paper will focus on satisfying the partition requirements to avoid the routing congestion. 
In this section, all the previous work devoted to multi-FPGA flow will be discussed further. 

Due to the large design which is not fitted in a single FPGA, a multi-FPGA prototyping flow is 
required to split the design into multi-FPGA. Most of the common problem during multi-FPGA prototyping 
is faced by the designer in the routing stage [10]. High congestion within inter-FPGA signal caused the 
design to be un-routable [11]. 

A study in [10] has identified that due to the inter-FPGA communications, the system frequency of 
the prototyped design is decreased and also, the number of inter-FPGA signals and critical path delay 
affected by the approach of partition technique. Therefore, the partitioner tool has been constrained in order 
to allow the automatic design partition by a tradeoff between criteria that affects the system frequency. 
An iterative routing algorithm is used to route the inter-FPGA signal and, for exceeding signal cases the 
multiplexing IPs are implemented in sending and receiving FPGA to transmit the signal through the same 
physical wire. In [12], a new approach is proposed using routability-driven routing approach which gives a 
better result than approach used in [10]. In [13], a proposed iterative routing algorithm in [14] 1s enhanced to 
route multi-terminal nets in multi-point tracks for routing cut nets in two point track by saving the FPGA 
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input/output (I/O). While, in [15], author has present partitioning method based on topological ordering and 
levelization. Result obtained from the the experiment is able to emphasis the performance of the design, 
however to avoid to focus on this thesis, we are solving the problem which 1s causing the design are not be 
partitioned due to the congestion error. 

In this paper, an automatic clock replication technique by using Synopsys Protocompiler will be 
applied in order to avoid the routing congestion by eliminate the cut clocks. Another technique applied in this 
paper is by manually redistributing the clock to other FPGA through HAPS global clock network manually 
on partition constraint file (PCF)[8]. Both of these techniques will be compared on the capability to meet the 
partition requirement to solve the cut nets which is causing the design to be un-routable. 

This paper is organized as follow; Section 2 discusses the step implemented in this paper from the 
analyzation of the failed design until the implementation of the approached techniques. Section 3 will 
summarize the approached techniques with the obtained result. 


3. RESEARCH METHODOLOGY 
In this section, all the precaution taken in order to satisfy the partition requirement in multi-FPGA 
prototyping flow will be discussed. A design with a multi-million logic gates will be used in this paper. 


3.1. Implementation Step 

Based on the multi-FPGA prototyping flow shown in Figure 1, a large design which is consist of 4 
core CPU based circuit named as design_3 will be partitioned into the multiple FPGA before synthesizing it 
in the individual FPGA stage. Most of the challenges faced in the multi-FPGA prototyping flow are in the 
partition stage as to meet the partition requirement. Therefore, this research work has been done on the 
implementation steps to meet the partition requirement. Figure 2 shows the steps taken to partition a design 
into multi-FPGA. 








Checks result after adding constraints. 
Analyze FPGA to FPGA connectivity 


Second 


partition 


-,- “.: Analyze, unrouted nets, utilization, 
feedthrough and TDM 


Abstract partition Analyze cut clocks, utilization and 
Iteration TDM ratio 
Check for clean partition with all nets 








routed. Look for acceptable cut clock 
count, feedthrough count and bin 
utilization 











Analyze TDM ratio, feedthrough and 
unrouted nets 








System Route 








System 
Generate 


Figure 2. Methodology for partition 





Before partitioning a design, 2 input files have been prepared which is TSS file describing the 
hardware setup and partition constraint file (PCF) describing a partitioning constraint. The Protocompiler tool 
will then partition a design according to the constraint defined in the PCF. Results of each partition iteration 
will be analyzed to ensure the partition requirement is met. 

In the initial stage, a run has been executed with default setting on the partition requirement where 
the Protocompiler tool will automate the partition stage before going to the next stage, system route. 
However, the design is not able to pass the routing stage as high congestion in an FPGA which cause the 
design is un-routable. The result from the partition report is showing failing partition stage because the 
partition requirements are not met. Therefore, an iterative process is required to refine the partition result 
until it is satisfied. 
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Referring to Figure 2, in the initial partition stage, the cable connectivity among the FPGA will be 
defined first in the partition constraint file (PCF) before analyzing the nets in the partition report. 
Global Route Summary section in the partition report has been visited to ensure there is no un-routed net is 
reported. If the un-routed net is reported then top-level port assignment, clock constraints and cell assignment 
constraints need to be checked. 

Figure 3 shows the first challenge faced in this research where lookup table (LUT) utilization in the 
FPGA A is high compared to FPGA D. The utilization of LUT for each FPGA should be below than 65% [8]. 
Figure 4 shows the partition requirement to be satisfied in the Global Route Summary section. 


@S |Mapping Summary: 
Total LUTs: 2650254 (102%) 


aS |Mapping Summary: 
Total LUTs: 225 (0%) 





Figure 3. Utilization report for FPGA A and FPGA D 


@S5.1.4 Ap426 |Multi Hops 
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@S5.1.5 AP267 |Global Route Summary 
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WT: APGS4 : 

[AP368 : 

rAP27O : 


Maximum TDM Ratio: 1 
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Feedthroughs: 0 
Unrouted: 0 





Figure 4. Global route summary 


It 1s shown that all partition requirements is satisfied except the total LUT utilization. Therefore, 
utilization for all bins has been divided manually by the number of the core in the design_3 through the PCF 
file. The 4 core CPU based design is divided by 2 cores for each FPGA and the main bus of the design is 
placed in the FPGA A using a PCF command. After the design is manually repartitioned again according to 
the new PCF constraint, then the utilization problem is solved, but another problem arises due to the 
existence of the cut clocks as per shown in Figure 5. Figure 6 shows the detail of existing cut clock where the 
clock is crossing FPGA boundaries and causing a clock skew problem which will result in low timing 
performance in the design. 
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Figure 5. Global route summary after for the second partition 
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.-3 AP408 |Partitioner Estimate of Clock & Asynchronous Reset Crossings 
= Clock: FB1.uA->{FB1.uD} smca.cpu0.ckmelksrecore [3] 


Clock: FB1l.uA->{FB1.uD} smca.cpu0.ckatpgcfgshift_core_tin22 [0] 
Clock: FB1.uA->{FB1.uD}) smca.cpu0.ckmelksrecore [2] 





Figure 6. Cut clock detail 


Therefore, Protocompiler tool’s feature is utilized for this issue by enabling the automatic clock 
replication on a specific clock tree of the problematic clock through the PCF. This feature able to solve cut 
clock in the partitioned design. However, this feature introduces an un-routed net between both FPGA as 
shown in Figure 7. Thus, alternative technique used in this research to avoid the cut clock problem, is by 
redistributing the clock to other FPGA through HAPS global clock network manually on PCF. 
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Figure 7. Unrouted nets 1s reported 


Fifure 8 shows an example of illustration for the cut clock occurred in the design during an FPGA 
partition stage where a clock from FPGA A 1s crossed boundary and connect with FPGA D. There are two 
cut clocks is shown in Figure 8 which is CLK1 and Gated-CLK1 has crossed the FPGA boundaries from 
FPGA A to FPGA D. 
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Figure 8. Illustration of the cut clock in a design 


Figure 9 represents a clock distribution technique for the cut clock issue in the design which is 
illustrated in the example on Figure 8. HAPS-80 platform has a global clock net which is connected to every 
single FPGA. Therefore, a CLK1 which is crossing the FPGA A to FPGA D will be connected to 
CLKC_SRC[1] before connect back to the FPGA D through the GCLK[7]. 


Manual clock distribution technique in partitioning stage for multi-F PGA... (Salahuddin Savugathali) 


642 O ISSN: 2502-4752 


CLK_SRC{I] GCLK[7| 


XTAL cloc | PLL/DCM " 
Clock gating 
generation ckt 


Block 3 
Block 1 
Gated-CLK2 
al Clock gating 
Bacto sh generation ckt 
Block 2 


FPGAA 





Figure 9. Solution for cut clock 


Figure 10 shows a constraint defined in the TSS file to configure the GCLK to be used in the design 
to be prototyped. TSS is a hardware specification file, where all the nets, clock, and traces to be used 1s 
specified to configure the HAPS-80 platform. While Figure 11 and Figure 12 represent a constraint defined 
in a partition constraint file (PCF). After configuring the HAPS platform, a PCF file must be defined to 
partition the design according to the specification. Therefore, a problematic clock which is causing a cut 
clock issues in this design will be connected to the global clock to another FPGA as per defined in a PCF file. 


#configure your selected GCLE to source the input from the FPGA 
TTT T TTT TT TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT 
board system configure -clock FR1.GCLEY FPGA 

board system configure -clock FR1.GCLES FPGA 

board system configure -clock FER1.GCLES FPGA 


] feontigure your selectea GCL vo source the input from the FECA 





Figure 10. constraint defined for TSS file 


FH TTT TTT AT TTT A TTT TT TTT TTT TT TTT AT TTT TT TT TTT aT TTT TTT TTT TT TTT TT TTT TT TT TTT TTT Te TTT TTT eT TTT 
#Define the functional group of internally generated clocks as GCLKs in the pef 
THTTA TAT TAT TTT TTA TAT TTT TTT TTT TTA TT AT TT TTT TTT TTA TTT TTT TT TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT 
met attribute {smca.cpu0.ckmclksrecore[3]} -function GCLK -diffsingle -is_clockl 
met attribute {smca.cpud.ckmclksrecore[2]} -function GCLK -diffsingle -is_clockl 
met attribute {smca.cpu0.ckatpgcfgshift core tin22[3]} -fumetion GCLE -diffsingle -is_ clockl 





Figure 11. constraint defined in PCF file (1) 


SST TTT T TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT TTT 
FAssign the internally generated clock to GCLK in the pef file 


assign global net {smca.cpu0.ckmclksrecore[3]} FR1.GCLE? 
assign global net {smca.cpu0.ckmclksrecore[2]} FR1.GCLES 
assign global net {smca.cpu0.ckatpgefgshift core tin2g2[3]} FR1.GCLES 





Figure 12. constraint defined in PCF file (2) 


The result obtained for all implementation step during satisfying the partition requirement will be 
discussed in next section. 
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4. RESULTS 

For multi-FPGA prototyping, partition requirement is set as a benchmark to be met in order to 
successfully routing the design before synthesized it in an individual FPGA level. Among the partition 
requirements to be met are the number of unrouted nets, total cut clocks and feedthrough in a design should 
be zero. While multi-hop net which is a connection between FPGA-to-FPGA must be below than 3 and each 
FPGA utilization should not exceed 65%. 

As the first run in an experiment, the default run has been executed without any fixes as the 
protocompiler tool is providing a feature to auto-partition the design. However, this run hit an error at 
partition stage as the requirement is achieved. Therefore, the generated partition report is revisited to check 
all the requirement for a successful partition. 


4.1. The Result of Auto-Partitioning Using Protocompiler Tool 

Table | shows the result obtained by using auto partition features using Protocompiler tool. Based 
on the recorded result in the table, all the requirement is met except for FPGA utilization. More than 100% 
average of the logic block is partitioned into an FPGA A while FPGA D is not utilized fully. Therefore, 
all other requirement is met since there 1s no any clock crossing the FPGA boundaries and not much multi- 
hop net 1s required. 


Table 1. Result of Partition Requirement Using Auto-Partition 


Partition Requirement Result 
Unrouted nets 0 
Cut clocks 0 
Feedthrough 0 
Multi-hop nets 1 


FPGA utilization FPGA A = 2650254(102%) 
FPGA D = 225(0%) 


4.2. The Result of Manual Partition on the First Iteration 

Table 2 represents the result of partition requirement by application of manual partition through the 
constraint which is explained in the previous section. An FPGA utilization using manual partition has 
achieved less than 65% for each FPGA, however, another problem surfaced which is the cut clock now exist 
that causing the partition stage to fail. 


Table 2. The Result of Partition Requirement Using Manual Partition (1st Iteration) 


Partition Requirement Expected value 
Unrouted nets 0 
Cut clocks 3 
Feedthrough 0 
Multi-hop nets 3 
FPGA utilization FPGA A = 1460430 (56%) 


FPGA D = 1143290 (44%) 


4.3. The Result of Manual Partition on the First Iteration 

For the second iteration of partition stage, an automatic clock replication technique by specifying 
the clock tree has been applied using a protocompiler tool. Table 3 shows the result obtained for the second 
iteration where the existence of the cut clock has been eliminated but causing a more serious problem in the 
routing stage where the unrouted nets exist. 


Table 3. The Result of Partition Requirement Using Automatic Clock Replication (2nd Iteration) 


Partition Requirement Expected value 
Unrouted nets 3 
Cut clocks 0 
Feedthrough 0 
Multi-hop nets 3 
FPGA utilization FPGA A = 1460430 (56%) 


FPGA D = 1143290 (44%) 
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Figure 13 shows the congestion area in an FPGA which is the reason of failure in routing stage. 
The maximum size of the congestion area should not exceed more than level 4. As seen in Figrue 13, two 
short congestion areas in NORTH and WEST has exceeded the maximum range with the size of level 5 
and level 6. 


NORTH | 


| 
SOUTH | 
| 

EAST | 


I 
WEST | 





Figure 13. Routing congestion level 


4.4. The Result of Manual Partition on the Final Iterationsub Section 1 

Therefore, an automatic clock replication technique which is applied before is reverted back and 
newly proposed technique is used, which redistributes the clock among the FPGA trough HAPS global clock 
network. All implementation step for these techniques has been well explained in the Section 1. 

Table 4 represent a final iteration of partition stage after using a proposed clock redistribution 
technique. Apart from achieving the partition requirement, the design also able to pass the routing stage 
before it is synthesized in individual FPGA level. 


Table 4. Result of Partition Requirement Using Clock Distribution Techniques (Final Iteration) 


Partition Requirement Result 
Unrouted nets 0 
Cut clocks 0 
Feedthrough 0 
Multi-hop nets 3 
FPGA utilization FPGA A = 1460430 (56%) 


FPGA D = 1143290 (44%) 


5. CONCLUSION 

In this paper, two different techniques which is an automatic clock replication by the Synopsys 
Protocompiler tool and our proposed technique Manual Clock Distribution technique have been applied 
separately to eliminate the presence of the cut clock, so that the circuit is able to meet the partition 
requirement to complete the prototyping process into multi-FPGA. 4 core CPU based SoC design has been 
used. An existence of the cut clock has been fixed by using the Manual Clock Distribution technique by 
100% elimination compared to the automatic clock replication technique which created an unrouted nets in 


the circuit. By using this technique, the circuit is able to pass the routing stage and prototyped into multiple 
FPGA accordingly. 
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